


Institutional Archive of the Naval Postgraduate School 





Calhoun: The NPS Institutional Archive 
DSpace Repository 


Theses and Dissertations 1. Thesis and Dissertation Collection, all items 


1997-03 


Parallel processing performance evaluation of 
mixed T10/T100 Ethernet topologies on Linux 
Pentium systems 


Decato, Steven W. 


Monterey, California. Naval Postgraduate School 
http://ndl.handle.net/10945/8798 


This publication is a work of the U.S. Government as defined in Title 17, United 
States Code, Section 101. Copyright protection is not available for this work in the 
United States. 


Downloaded from NPS Archive: Calhoun 


Calhoun is the Naval Postgraduate School's public access digital repository for 
(8 DUDLEY research materials and institutional publications created by the NPS community. 
«ist sae Calhoun is named for Professor of Mathematics Guy K. Calhoun, NPS'‘s first 


INN KNOX appointed — and published -- scholarly author. 

| LIBRARY Dudley Knox Library / Naval Postgraduate School 

411 Dyer Road / 1 University Circle 
Monterey, California USA 93943 





http://www.nps.edu/library 


NPS ARCHIVE 
1997.03 
DECATO, S. 


NAVAL POSTGRADUATE SCHOOL 
Monterey, California 





THESIS 


PARALLEL PROCESSING PERFORMANCE 
EVALUATION OF MIXED 1T10/T100 ETHERNET 
TOPOLOGIES ON LINUX PENTIUM SYSTEMS 

by 
Steven W. Decato 


March 1997 


Thesis Advisor: Bert Lundy 





fines is 
Wes Approved for public release; distribution is unlimited. 





IUDLEY KNUs aBRARY 
{AVAL POSTGRADUATE SCHOO. 
AONTEREY “+ 13943-5101 


DUDLEY KNOX LIBRARY 
NAVAL POSTGRADUATE SCHOOL 
MONTEREY, CA 93943-5101 





REPORT DOCUMENTATION PAGE Form Approved OMB No. 0704-0188 


Public reporting burden for this collection of information is estimated to average 1] hour per response, including the time for reviewing instruction, searching existing data 
sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other 
aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Opcrations and 
Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) 
Washington DC 20503. 


AGENCY USE ONLY (Leave blank) ve REPORT DATE 3.) REPORT TYPEAND DATES GQ@VERED 
March 1997 Master’s Thesis 


TITLE AND SUBTITLE TITLE OF THESIS: Parallel Processing Performance . FUNDING NUMBERS 
Evaluation of Mixed T10/T100 Ethernet Topologies on Linux Pentium 
Systems 


6. AUTHOR(S) Steven W. Decato 


PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) . PERFORMING 
Naval Postgraduate Schoo] ORGANIZATION 
Monterey CA 93943-5000 REPORT NUMBER 


9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSORING/MONITORING 
AGENCY REPORT NUMBER 


11. SUPPLEMENTARY NOTES The views expressed in this thesis are those of the author and do not reflect the 








official policy or position of the Department of Defense or the U.S. Government. 


12a. DISTRIBUTION/AVAILABILITY STATEMENT 12b. DISTRIBUTION CODE 
Approved for public release; distribution is unlimited. 


oe = Py et 

















13. ABSTRACT (maximum 200 words) 

The intent of this thesis is to answer the question as to whether real-time battlefield visualization, once 
requiring high-speed UNIX workstations and specialized parallel processors, can now be now performed on. 
relatively inexpensive off-the-shelf components. | 

Alternative network topologies were implemented using 10 and 100 megabit-per-second Ethernet cards! 
under the Linux operating system on Pentium based personal computer platforms. Network throughput, processor 
and video performance benchmark routines were developed to assess the hardware’s potential for parallel application '! 
in a distributed environment. Code was first ported to the Linux environment. Benchmark routines were then' 
developed and tested on various machines. 

Dual 200 MHz Pentium Pro processor performance exceeded the dual processor 50 MHz SUN and 40 MHz 

_SGI UNIX workstations currently used for terrain generation by a factor of 30 using a simple ray trace algorithms as 
a basis for comparison. The Intel Pentium Pro personal computer proved to be a capable platform for generating six 
to ten frame-per-second terrain simulations. However, Fast Ethernet throughput averages only 2.5 megabytes-per- 
second, thereby limiting the usefulness of a distributed approach designed to increase performance by dividing 
workload across the network. 





14. SUBJECT TERMS Simulation, Perspective View Generation, Benchmarks, Performance, 5. NUMBER OF 
Parallel, Clusters, MPI PAGES &4 





16. PRICE CODE 


| em sk@linibiesGiASolhiGAq-lelSmsesbECURITLY CLASSIFI- 19. ~SEGURITY CLASSIFICA- | 20. LIMITATION OF 
TION OF REPORT GATION OF THIS PAGE TION OF ABSTRACT ABSTRACT 
Unclassified Unclassified Unclassified iE 





NSN 7540-01 -280-5500 Standard Form 298 (Rev. 2-89) 
Prescribed by ANS] Std. 239-18 298-102 





Approved for public release; distribution is unlimited. 


PARALLEL PROCESSING PERFORMANCE EVALUATION OF MIXED 
T10/T100 ETHERNET TOPOLOGIES ON LINUX PENTIUM SYSTEMS 


Steven W. peealte 
Major, United States Army 
B.A., University of South Florida, 1985 


Submitted in partial fulfillment 
of the requirements for the degree of 


MASTER OF SCIENCE IN COMPUTER SCIENCE 
from the 


NAVAL POSTGRADUATE SCHOOL 
y March 1997 


eo, 





DUDLEY KNOX LIBRARY AON TERE. ME OUHOO 
NAVAL POSTGRADUATE SCHOOL cy * 13943-5101 
MONTEREY, CA 93943-5104 


ABSTRACT 


The intent of this thesis is to answer the question as to whether real-time 
battlefield visualization, once requiring high-speed UNIX workstations and specialized 
parallel processors, can now be now performed on relatively inexpensive off-the-shelf 
components. 

Alternative network topologies were implemented using 10 and 100 megabit-per- 
second Ethernet cards under the Linux operating system on Pentium based personal 
computer platforms. Network throughput, processor and video performance benchmark 
routines were developed to assess the hardware’s potential for parallel application in a 
distributed environment. Code was first ported to the Linux environment. Benchmark 
routines were then developed and tested on various machines. 

Dual 200 MHz Pentium Pro processor performance exceeded the dual processor 
50 MHz SUN and 40 MHz SGI UNIX workstations currently used for terrain generation 
by a factor of 30 using a simple ray trace algorithms as a basis for comparison. The Intel 
Pentium Pro personal computer proved to be a capable platform for generating six to ten 
frame-per-second terrain simulations. However, Fast Ethernet throughput averages only 
2.5 megabytes-per-second, thereby limiting the usefulness of a distributed approach 


designed to increase performance by dividing workload across the network. 
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IT INTRODUCTION 


A. PURPOSE 


The purpose of this research paper is to present the evaluation of Personal 
Computers (PC) clusters as a potential low cost alternative to UNIX Workstations for 
supporting real-time terrain modeling simulations. Our intent was to explore various 
network communications protocols and determine maximum network throughput as a 
basis for a parallel implementation of the PEGASUS Perspective View Generator. 
Benchmark programs were developed and results analyzed as a function of network 
topologies, number of processors, and link communication speeds. Although a majority 
our work involved analysis of network performance, benchmark programs were also 
written to evaluate processor and video performance. Research included exploring the 
difficulties in selecting and installing appropriate hardware using the Linux operating 
system. Porting problems encountered in moving software from the UNIX workstation 
environment to an Intel PC based Motif/X Window platform were examined. The intent 
of this analysis is to answer the question of whether real-time battlefield visualization, 
once conducted in the lab on high-speed workstations and specialized parallel processors, 


can now be now performed on relatively inexpensive off-the-shelf components. 


B. BACKGROUND 


In 1983 when the Ada programming language was first developed, the IBM 
Personal Computer XT (PC) had been in production just two years and was already being 
replaced by the IBM PC/AT. In the same year Rick Mascitti coined the name C++ for 
Bjarne Stroustrup’s “C with Classes”. The Intel 80286 was the most advanced Intel CPU 
on the market having been introduced in 1982. In just fifteen years, performance of these 
low cost processors has blurred the computational benefits that have typically separated 
UNIX workstations from the personal computer. 

Today the development of high-speed local area networks, low cost CPUs running 
with clock speeds in excess of 200 megahertz, and the proliferation of inexpensive 
Personal Computers (PC) have expanded our vision of a distributed system. The 
possibility of thousands of networked computers working concurrently on a single task or 


cooperatively on many tasks is within our reach. In today’s climate of right sizing, 


| 


smaller budgets are forcing purchasers to leverage more performance out of existing 
hardware. A PC based distributed approach offers the promise of low cost scalable 
performance improvements. 

The personal computer has also become a popular game platform due to recent 
gains in video display performance. Scores of simulation-type games put the player in 
the cockpit or on the ground navigating through a virtual world in real-time. Although 
video realism has improved dramatically through texturing, these games have done little 
in the way of placing the user in a photo-realistic and geographically accurate 
environment. 

This area of research is not limited by available geographical data. The United 
States Geological Survey is conducting a National Mapping Program (NMP) that 
includes the production of digital cartographic data and graphical maps. Figure | 


illustrates the regions of the United States already included in the digital survey. 
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Figure 1. National Mapping Program Progress To Date 


Researchers have been working in this area for several years at the US Army Test 
Facility at Ft. Hunter Liggett, California. They have developed a high fidelity, terrain 
accurate, simulation system known as PEGASUS. Starting with low-resolution satellite 
imagery and the corresponding terrain data from the Defense Mapping Agency, this 
system adds additional high-resolution photographic data sources to generate a terrain 


database accurate to one-meter resolution. 


With the closing of Ft. Hunter Liggett, work proceeded at the Naval Postgraduate 
School to build world-wide database structure along with tools for integrating standard 
Defense Mapping Agency products (Baer, 1995). A perspective view generator was also 
developed which included an X-Window interface operating on a Silicon Graphics 
workstation. 

A parallel research effort conducted between the Naval Postgraduate School, 
TRADOC Analysis Command Monterey, and US Army TEXCOM was started to 
integrate all the legacy code developed on the SUN, Silicon Graphics, and Transputer 
platforms into a single low cost rapid terrain generation and perspective view generation 
workstation. This project, code named TELLUS, ported existing code from a proprietary 
parallel processor machine to a relatively portable X-Window environment. The work 
focused on extending the database generation procedures to lower resolution formats. 
thereby providing the capability of feeding virtual and constructive simulators. 

One underlying requirement was to identify the feasibility of moving the 
application to more powerful computer hardware capable of supporting the processing 
needs of realistic battlefield replication and rapid database generation without a 
significant cost increase. The most likely low cost candidates are systems built from 
commercial PC components. Personal computers have historically been unable to 
provide the performance necessary to support computationally intensive applications such 
as PEGASUS. Thus expensive workstations and specialized equipment were necessary 
to achieve reasonable performance. 

We believe the recent performance improvements will allow low cost commercial 
components to provide the computing power required for high-resolution terrain database 
construction and real-time perspective view generation. To test this hypothesis we have 
constructed a test-bed using two Pentium based PCs, a fast 100 BASE-T Ethernet 
operating under the Linux operating system and several 486 PCs. Using this equipment, 
we developed benchmark and test routines to estimate the performance of our simulation 
application. After executing the benchmarks outlined in this thesis we find the results 
are extremely encouraging. 

This report provides further details on the configuration, software environment, 
and benchmark tests performed to explore the suitability of using off-the-shelf PC 


equipment for parallel high-resolution battlefield simulation. 
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C. RESEARCH QUESTIONS 


1. Primary Research Question 


To what extent can the PCI bus and new high performance peripherals deliver the 


performance necessary for battlefield simulation applications? 


2 Secondary Research Questions 


e What other performance considerations affect the suitability of the Personal 


Computer as a prospective platform for high- speed terrain modeling? 


e Can a distributed approach enhance performance of the Perspective View 


Generator? 


D. SCOPE 


This thesis focuses primarily on the use of Intel based Personal Computer using 
the Linux operating system. In the interest of time and expense we chose not to examine 
other compatible processors such as those manufactured by DEC, AMD and Cyrix. 
Linux was chosen as the target operating system because of its low cost and relatively 


minor differences between the UNIX variants running on high-end workstations. 


| Oe METHODOLOGY 


To achieve our goal, the research was divided into several tasks. First, hardware 
was purchased to host the Linux operating system. Linux was then installed and the 
network put into operation. Research was conducted to evaluate the availability and 
applicability of various communications Application Programming Interfaces (APIs). 
Once the APIs were chosen, various benchmark programs were written in C to test 
system performance. Finally, the results were recorded and analyzed to evaluate our 


research questions. 


F. ORGANIZATION 


Chapter If (Building a Network) provides an overview of the hardware acquired 
for the project. A discussion of the PC hardware architecture that impacts on 
performance of the Perspective View Generator performance is presented. Chapter II 
also examines processor, video and network limitations. 

Chapter III (Performance Analysis) presents the benchmark methods and 
programs developed to test PC performance using TCP/IP sockets and streams, FTP file 
transfers, and the Message Passing Interface (MPI) programming environment. 

Chapter IV (Summary, Conclusions and Recommendations) summarizes the 
findings of the research, answers the research questions, and presents recommendations 


for further research and study. 


G. BENEFITS OF STUDY 


Personal computer cost is a fraction of UNIX workstation cost. The benefits to 
the training and intelligence community are equally exciting. Given relatively recent data 
from the battlefield, such a system could provide intelligence analysts, pilots or soldiers 
the ability to safely walk or fly over terrain in a virtual environment. Near real-time 
intelligence data could also populate the view with enemy target locations and provide 
three dimensional battlefield visibility. Vehicles, equipment, and even aircraft could be 
dynamically tracked over an accurate view of the terrain based on digitized photographic 
and elevation data available from a multitude of sources. Graphical overlays could be 
placed over the terrain to give planners a feel for boundaries and identify key troop and 
re-supply locations. Additionally, the effects of current atmospheric conditions could be 
added to the scene to produce even more realistic images based on time and weather 
conditions. 

This thesis provides the building blocks for analyzing networked PC system 
performance. Our research constructs a performance model of a personal computer 


hosted system that we hope will soon be capable of handling such a scenario. 





II. BUILDING A NETWORK 


A. HARDWARE CONFIGURATION 


The TELLUS configuration is expected to support three main functions: 


e Analysis Functions - requiring standard low cost report generation, 


communication, mathematical and presentation tools. 


e Database Generation- requiring multiple seats, large image storage 


capacity and emphasizing graphic interactivity. 


e Battlefield Simulation- requiring super computer performance 


processing, real-time communication, and single function operation 


The networked PC based design will eventually support all of these. The personal 
computer has historically supported the first function, serving as a relatively low cost. 
low speed platform for business applications with abundant software tools that have made 
the PC so successful. We have only recently turned to the personal computer for 
scientific database generation and battlefield simulation due to the emergence of fast 32- 
bit processors connected to high-speed communications devices. Multiple processors can 
now share large image databases during interactive database creation. Likewise, these 
Same processors may soon join forces and provide the processing power to support 
realistic real-time battlefield simulation. 

Figure 2 shows a block diagram of the design configuration for the TELLUS 
project. Although only two processors are shown in detail, the intent was to design a 
scalable networked system. The diagram is functionally organized with input devices on 
the left, computational network power in the center, and output devices on the right. This 
diagram represents the minimum system that can be used to explore networked parallel 
processing applications and act as controller/peripheral device pair interfaces to the bulk 


of the computing power contained in the expanded network. 
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Figure 2. TELLUS Hardware Configuration 


Cost is the main attraction of the PC based configuration. A high-end personal 


computer system, such as the Pentium 90 network machine purchased for the project, has 


remained a constant $3000 since the IBM PC was first introduced. Only the capabilities 


and performance have increased. In November of 1995 when we first considered 
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purchasing equipment for this research, the Pentium 133 was the fastest personal 
computer on the market. Since then processor performance has doubled, memory prices 
have plummeted from $35 per megabyte to $5 megabyte and IDE hard drive prices have 
dropped from $200 per megabyte to $100 megabyte. We believe interface nodes with 
peripherals will continue to cost about $3000 while network computational nodes will 
cost under $1000 each. 


B. THE OPERATING SYSTEM 


We chose the Linux/PC combination for the first attempt at a PC port of 
PEGASUS. This UNIX-like operating system has a large following among researchers 
and computer science students alike. Its ability to compile and run source code intended 
for UNIX Workstations is especially attractive. Furthermore, Linux is available at little 
or no cost. In contrast, SCO UNIX is commercially available for $400 per node. This 
additional cost seems prohibitive since it substantially increases total system cost as 
nodes are added. Linux provides the same functionality with no requirement for 
additional licensing fees as nodes are added. The decision to use Linux had one major 
drawback. When our research first began, not a single book existed on Linux in any local 
bookstore. Fortunately this information void has improved. Most major bookstores now 
have several books devoted to the operating system. Even Motif has been ported to the 
PC to replace the barren X Window desktop. 

Our first task was to obtain Linux and install it on a network of five computers; 
two Pentium 100 systems and three 486-66 systems. During the installation phase, we 
had a variety of hardware problems to contend with. Drivers were difficult to locate for 
our mixture of SCSI and IDE storage subsystems. Without a CDROM drive, we had to 
download Linux from Internet sites one floppy diskette at a time. The Pentium systems 
utilized Adaptec 2940W SCSI cards not directly supported under Linux at the time. The 
first of many challenges was to obtain a beta version of the driver mentioned briefly in an 
obscure HOWTO document on the Linux distribution CDROM. Once the source code 
for the driver was obtained, it was compiled into the operating system kernel. 

Maintaining a Linux network is time intensive and a thorough knowledge of 
UNIX system administration is required. Those of us with little experience in this area 
found that Linux presented an administrative burden that sometimes interfered with 


research. This is true of any UNIX implementation. 
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C: THE PROGRAMMING INTERFACE 


Once the network was running under Linux, we scoured the Internet searching for 
available message passing libraries to ease the development of benchmark programs that 
would be used to test the potential speed of the network. The two most popular 
Application Programming Interfaces (APIs), TCP/IP sockets and streams, were included 
with Linux. We were able to find many sources of information on the topic. Benchmarks 
programs were quickly developed and tested. 

After evaluating the available libraries, we settled on the LAM port of Message 
Passing Interface for our secondary tests. An MPI standard has been established and 
subsequently ported to many platforms. For this reason, MPI appeared attractive as an 
interface that could be ported from one environment to another with relatively minor 


modifications. 


D. PARALLELISM AND CLUSTERING 


Perspective View Generation is a computationally intensive task requiring 
millions of calculations per second to support smooth frame rates. We were attracted to 
the idea of dividing the workload across network clusters in hopes that dividing the 
problem would result in a faster screen refresh rates. The serial version of the program 
had already been optimized to the maximum extent possible. 

The notion of a network of computers working together is defined as a 
distributed system. A distributed system is an interconnection of one or more processing 
nodes (a system resource that has both computational and storage capabilities), and zero 
or more storage nodes (a system resource that has only storage capabilities, with the 
storage addressable by one or more processing nodes)”. (USO/TEC 8652, 1995) 
Distributed systems differ from centralized systems in that centralized systems consist of 
a single CPU, its memory, peripherals, and some terminals. 

“A distributed program comprises one or more partitions that execute 
independently (except when they communicate) in a distributed system.” (ISO/IEC 8652) 
If we accept Booch’s definition of an embedded system as it applies to multiple 
networked CPUs working together concurrently (Booch, 1987), can this technique 


provide benefits to a PC based perspective view generator? 
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Andrew Tanenbaum divides Flynn's Multiple Instruction Stream, Multiple Data 
Stream model (Flynn, 1992) into two subcategories: Multiprocessors that share memory 
and multi-computers having private memory (Tannenbaum, 1992). We will limit our 
discussion to a bus-based multi-computer comprised of two or more workstations 
connected to a Local Area Network (LAN) using Fast Ethernet. 


L. The Central Processing Unit (CPU) 


The decision to build applications on networked clusters requires an 
understanding of the performance improvement expected from single processors. The 
time and effort it takes to restructure an application using a parallel design is often futile 
as single node performance improves faster than the development effort expended to 
achieve modest gains. This section provides insight into current processor technology 
that guided our purchasing decisions. 

The Pentium Pro is Intel’s latest and most powerful processor. While currently 
shipping versions are clocked at 200 megahertz, Intel has demonstrated a 300-megahertz 
Pentium. Tests show that the 300 megahertz chip delivers 50% better performance than 
Intel's top Pentium Pro. The new chip won't be available for home PCs until late summer 
1997. Sources at the International Solid State Circuits Conference also report that Intel's 
processors have reached clock speeds of 433 megahertz and 451 megahertz in the 
laboratory. The 233-megahertz and 266 megahertz Pentium MM<Xs will reportedly be 
out by early summer 1997 as well. 

Digital Equipment’s 500mhz 21164 is currently the fastest CPU installed in 
personal computers today. The recent growth of Microsoft Windows NT as the software 
of choice for server platforms has hindered the Alpha’s acceptance. Although Microsoft 
has continued to provide a native version of NT for the Alpha platform, software 
developers have focused development effort on Intel specific applications. 

Digital has responded recently by releasing FX!32, an Intel Windows emulator 
capable of directly executing Intel binaries. Unfortunately, as with most emulators, it 
does not provide the performance the 500mhz 21164 is capable of. Furthermore, it is not 
100% compatible with all Intel based Windows binaries. 

The 21164 is twice as fast as Intel’s Pentium Pro 200 when comparing integer 


operations, and nearly four times faster when comparing floating-point operations. 


I] 


Figure 3 compares integer and floating point performance of the Alpha and Pentium Pro 


processors using the SPEC95 benchmark suite discussed in the next section. 


0 Alpha 21164 
Mi Intel Pentium Pro 200 


Specint95 SPECfp95 





Figure 3. SPEC Benchmark Comparison of Alpha and Intel CPUs 


Measuring processor performance can be controversial. There are many 
benchmark software products useful for measuring specific capabilities of a system, but 
reliance on any one suite can mislead potential system buyers. Identifying how 
benchmark results relate to a user’s computing needs is a difficult but essential part of the 
system acquisition process. Often manufacturers will design their own benchmark 
programs. The results are then published as marketing tools for touting their products. 
These numbers may have little relevance to overall system performance in actual 


applications. 


2. Intel Comparative Microprocessor Performance Benchmark 
(i COMP) 


Intel publishes its own iCOMP (Intel Comparative Microprocessor Performance) 
index. They are forthright in admitting that their only intent is “to help end users decide 
which Intel microprocessor best meets their desktop computing needs.” The 1COMP 
Index 2.0 rating is based on the integer, floating point and multimedia capabilities. 
Figure 4 demonstrates the near linear increase in performance in Intel’s Pentium 


processor line using the iCOMP index. 
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ICOMP’ Index 2.0 


ICOMP Index 2.0 compares the relative performance of different Entel microprocessors. 
SPEED RATING | 


PAOCESSORS |. 2 F ’ Tale 125 
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Figure 4. Intel’s iCOMP Index 





oO The Standard Performance Evaluation Corporation (SPEC) 


A more neutral benchmark suite may be appropriate when comparing general 
performance between one system and another. SPEC is a non-profit corporation formed 
to "establish, maintain and endorse a standardized set of relevant benchmarks that can be 
applied to the newest generation of high-performance computers” (from SPEC's bylaws). 
The founders of this organization believe that the user community will benefit greatly 
from an objective series of applications-oriented tests designed to serve as common 
reference points and be considered during the evaluation process. SPEC publishes 
vendor benchmark results on its World Wide Web home _ page at 
http://www.specbench.org (SPEC, 1997). 

SPEC95 is the current suite of benchmarks published by SPEC. These 
benchmarks measure the performance of the processor, memory system, and compiler 
code generation. UNIX is normally used as the portability vehicle, but the benchmark 


programs have been ported to other operating systems as well. The percentage of time 


lie 


spent in Operating system and I/O functions is generally negligible. SPEC95 CPU 
benchmarks are internally composed of two collections: CINT95 and CFP95. 


a. CINT95 


CINT 95 is a suite of integer based programs representing the CPU- 
intensive part of system or commercial application programs. Figure 5 illustrates the 
relative integer performance of the Intel Pentium series processors. (SPEC, 1997) Notice 
that the performance gains follow the clock rating of the chip itself. Like the Intel’s 
iCOMP rating, increases in performance since the introduction of the Pentium 100 have 
been nearly linear. However, the recent introduction of the Penttum MM<X series has 
pushed performance of the 166 megahertz Pentium MMX beyond that of the Pentium 
200. 
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Figure 5. Integer Performance of the Pentium Series 
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Intel’s top performers, the 200 megahertz Pentium Pro is still champion of the 
average desktop system, but figure 6 shows that there is no substitute for higher clocked 
processors to boost performance (SPEC, 1997). The increase in performance is not linear 
based on clock speed alone. Other factors play a role in overall performance including on 


board cache size, RISC versus CISC architecture, Bus design and high-speed peripherals. 
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Figure 6. Integer Performance of Various CPUs 


b. CFP95 


Likewise, CFP95 measures the CPU-intensive part of numeric-scientific 
application programs. When compared with other processors, Intel lags well behind the 
competition in floating point performance. As stated previously, the Pentium Pro is 
eclipsed by Digital’s 500 megahertz 21164. When examining the benchmarks presented 
in figure 7 (SPEC, 1997), it is obvious that performance is sacrificed at the expense of 
cost. Further research should be conducted to determine if the faster Alpha processors 


see relative increases throughout the architecture. We have seen bottlenecks that appear 
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to be Bus related. Moving to a faster CPUs may not reap the performance benefits 


indicated by the benchmarks. 
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Figure 7. Floating Point Performance Comparison of Various Processors 
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the “Rate” suite. SPECInt Rate measures the integer capacity of a system while SPECfp 
Rate measures the floating-point capacity (SPEC, 1997). 


d. SPECint_rate95 


Sometimes adding an extra processor can make up for lack of 
computational power. Figure 8 demonstrates that characteristic (SPEC, 1997). In fact. a 


dual Pentium Pro 166 machine is about as fast as the envied 500 megahertz Alpha 21164. 
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Figure 8. SPEC’s IntRate95 Demonstrates Processor Scalability 
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II. PERFORMANCE ANALYSIS 


Our approach has been to define critical functions, write small benchmark 
programs designed to test these functions, and finally compare the execution times 
between workstations upon which the PEGASUS software is currently implemented with 


those achieved in the PC configuration. 


The three main areas selected for testing are: 


e Communications: How fast can messages and data arrays be sent from one 


processor to another? 


e Image display: How fast can an image in main memory is transferred to the 


display? 


e CPU and Main memory: How fast can the inner ray-trace loop and terrain 


arrays be accessed? 


In addition to these questions, there are disk performance issues requiring the 
quantification of the wide 16 bit SCSI drives, the impact of swap space, the effect of main 
memory size, and a host of issues that can also effect performance. We have 
concentrated on the three areas mentioned above because in the past personal computers 
have not performed well enough in these areas to be used instead of workstations. Until 
these performance hurdles had been overcome, further detailed investigation and 


optimization was pointless. 


A. COMMUNICATION TESTS 


At the time of writing, four 3COM 100 bit per second communications cards have 
been installed under Linux and several communication tests performed and are outlined 


in this section. 


Figure 9 shows comparative benchmarks published in trade journal literature for 
1OO0BT-card operation. According to these numbers, close to a megabyte per second per 
link can be expected from a 5-client cluster using Novell’s NetWare 4.02 on a HP 
NetServer 4/100 LC (LAN Times Magazine, 1995). A single client transfer rate ranges 
from 2.5 to 7Mbps. This is a long way from 100Mbits/s raw hardware speed. 


ive clients one server 


| aie X 
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one client one server 








Megabyte per second throughput 


Figure 9. EtherExpress 10/100 Fast LAN Performance 


Our research indicates these numbers are overoptimistic. Considerably slower 
speeds can be expected in a reliable contention-less domain. The use of a Fast Ethernet 
in our test systems utilizing a small number of processors achieved performance on par 
with the 1.2 megabyte-per-second communication speeds achieved using the current 
Transputer Link real-time perspective view generators. This speed is not considered 
adequate for smooth parallel frame rate operation. However, it is adequate to support 
several frames per second and provide playback speeds comparable to those achieved 
more expensive equipment. 

The following tests were conducted to explore performance gained using other 


communications methods: 


Ii: FTP File Transfer Test 


Before designing any low level benchmark programs to test potential network 


performance, we decided to use available utilities to establish rough performance figures 
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Zz. Local File Copy Test 


To test the speed that data could be moved local from one region of a disk to 


another, the following script was executed: 
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Figure 12 indicates the performance achieved from executing the previous script. 
These numbers were also repeatable and did not change even if copies were made from 
the SCSI to the IDE drive or from SCSI to SCSI. 
unambiguous. Hence the communication is between 1000K and 4100K bytes/sec with 


Approximate results are quite 


the average being 2400K per second. The local transfer rate is up to ten times that of the 


1OBASET capacity and as much as twice as fast and 100BT network transfers. 
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server could display images in this configuration. The program sets up two image buffers 
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An X-server program viewtest was written to test the speed with which an X- 
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Figure 12. Linux Local Disk Copy Transfer Rate Samples 


X-Server To Remote Machine 


and sequentially calls the server with the subroutine outlined below: 
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XPutImage (XtDisplay (window->w), XtWindow(window->w), window- 


>gc, window=>image a0 0 Ce a ee 





This program was run using a server on a local host baerpe and a server on a 


remote host wolfpc. A session dialogue follows: 


NOTE: 65536 bytes/frame * 100 frames/sec = 6.55 Mbytes/sec 


baerpc:~/BENCH MARK/PIXELSS viewtest -display wolfpc:0.0 
Generating pattern l... 

Generating pattern 2... 

PRESS RETURN TO START BENG MARK... 

Total time took 3 seconds to display 100 frames 

Average frame rate was 33 per second 

NOTE: 65536 bytes/frame * 33 frames/sec = 2.16 Mbytes/sec 


baerpc:~/BENCH MARK/PIXELSS viewtest 
Generating pattern l... 


Generating pattern 2... 
PRESS RETURN TO START BENGH VAR 


Total time took 1 seconds to display 100 frames 
Average frame rate was 100 per second 





ANALYSIS: X-server takes 1/6.55 = 1526 sec/Mbytes to display on local host 
X-server takes 1/2.16 = 4629 sec/Mbytes to display remotely 
Whe qitterence: 3103 sec/Mbytes attributed to Comm delay 


RESULT: steady state transfer rate is 3.222 Mbytes/sec 


This result is quite fast, repeatable to a factor of 2 and of unknown reliability. The 
window observed would flip by quickly. Consequently, there was no chance to verify the 
detailed accuracy of the image displayed. The issue of reliability 1s discussed in the 


following section. Raw transfer rates of 3Mbytes/sec represent approximately 50% of the 


bare hardware speed and may easily be close to maximum throughput achievable for 


steady state operation. 


4, Socket Transfer Test 


In order to get further speed tests and become familiar with the primitive program 
API’s that are available, a socket based client-server pair of programs were written to test 
program to program interconnectivity and transfer rates. Code details are contained under 
the BENCH MARK/COMM_TEST directory, which stores the programs written for 
testing communication rates between machines. 

Two programs are of special interest since they serve as the prototype for both 
TCP/IP STREAM socket programming and the communication speed tests conducted. 
These two programs are called client_variable.c and server_variable.c and are listed in 
Appendix B. 

Figure 13 shows a block diagram of the standard UNIX call sequence on both the 
client and server side required to implement this program to program interface. The 
program opens a socket between the two machines. A header is sent by the 
client_variable() to specify the size of the data blocks about to be sent. The client then 
sends a block. The server reads it and echoes it back. The client reads the echo and in 
turn echoes the received data back to the server. Both programs Ping-Pong the data back 
and forth a number of blocks times. After the loops are completed the client checks the 
data content and prints out the time elapsed and number of errors occurring in the overall 
transmission. The programs must be run in pairs with the server started first on one 
machine and the client started second. The server waits at the “listen” until the client tries 


to connect. 


client_vartable.c Server_Variable.c 


Y connection request to socket # 1053 


ocket #1053 
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Figure 13. UNIX TCP Calls for Socket Connection 





Communication speed tests involve both raw hardware speeds as well as the setup 
and error checking conducted by the communications (in this case TCP/IP) software 
layers. The overall transfer speeds therefore depend upon the block size being sent since 
communication software overhead is largely independent of the message size. Transfer 
rates were calculated by dividing the total number of bytes transferred back and forth by 
the total amount of real time to do so. 

The client server program pair was run as both single and simultaneous parallel 
processes. Configuration of these processes and the network hardware accessed are 
shown in Figure 14. Each server is started listening to a unique port number. The clients 
then attempt to establish a connection to these port numbers. In addition to port numbers, 


the clients specify the network name of the server machine. Hence, as shown in figure 


7 


14, the clients connect through wolfpc, wolfpce2, or wolfpce3 to distinguish between the 


connection hardware desired. 
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Figure 14. Process and Network Configuration 


Figure 15 shows the cumulative transfer rates and the percent of idle CPU time as 
a function of the process and connectivity configuration exercised. The processes 
running are specified by the xs in the columns on the left. For example, the first three 
lines indicate only one process was active on the T10, T100 hub, and T100 direct 
connection lines at a time. The fourth line indicates two processes were active on the 


T100 Hub connection. 
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Figure 15. TCP/IP Communication Speed Results 


The results show typical efficiency increases as the block size increases. The 
CPU idle percentage when nothing operated was approximately 95%. During test 
execution this number was approximately the same for all block sizes so only one number 
is shown. The numbers clearly show that software efficiency is the biggest resource 
consumer. 

The throughput did not increase when using two hardware devices compared with 
one (line 4 vs. line 7). The two-process single hardware link provided the most efficient 
data transfer rates. Though peak performance of 2.5 megabytes per second were achieved 
using 6 processes on 3 hardware links, the extra half megabyte per second cost 
approximately 30% of the CPU time while cost associated with the first 2 Megabytes was 
only 25%. 

These results represent an application’s program-to-program transfer rate of 
approximately 15% of the maximum hardware efficiency of approximately 8 Megabytes 
per second throughput advertised on the 1OOBT lines. This rate is par for typical Ethernet 
connections. 

Our conclusion from these tests is that substantial performance increases could be 
achieved on this low cost hardware if special driver code were written. However, if we 


stick to generally available TCP/IP Ethernet implementations, throughput in the 
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megabyte-per-second range can be expected between point-to-point links. This 
performance compares favorably with the 0.7 Megabytes achieved on the T800 transputer 
links used in the parallel processing high resolution battlefield perspective view 
generator. 

An additional lesson can be derived from the test results. The efficiency of 
multiple hardware connections is not substantial enough to warrant the use of the scarce 
PC slots or available IRQs. A grid configuration such as supported by the 4-link per node 
transputer systems are not recommended here. Instead, a single or double string 
configuration seems advisable. This result would favor the use of multiple CPU nodes 
such as those available in the Dual and Quad Pentium cards currently on the market to 


handle extremely compute intensive applications. 


5. UDP Communication Test Results 


We switched to a connectionless socket UDP protocol and used the sendto() and 
recvfrom() functions for communicating. The UDP protocol reportedly increases 
throughput by a factor of two at the expense of reliability. 

Figure 16 shows the measurements made. Throughput increased only by a 
modest 20%. We did find the transfer was unreliable but errors only showed up after 


minutes rather than seconds of continuous operation. 
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Figure 16. UDP Communications Speed Results 
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Considering the error checking that would be required, there seems to be no major 


advantage to the UDP protocol on these machines. 


6. Additional Communication Test Results 


This section documents additional communication tests conducted to explore 
alternative programming approaches. Most of the results are negative in the sense that 
little or no performance improvement over the previous results was achieved. Hence 


their value lies in the lessons learned. 


a. Compiler Optimization Tests: 


In order to check the effect of compiler optimization we compiled our test 


program with the maximum O2 compiler optimization switch as follows: 


gce [O02 =O.Serverivartabilemoen ey em tae 


gce -O2Z [0 client var able el) Siege are icp 





We then sent and received 1000 blocks of 1000 bytes for a total transfer of 
2,000,000 bytes with 0 errors in 1.470000 seconds at 1.360544 megabytes per second. 
This compared with similar transfer rates shown in figure 15 when using 1000 byte size 
blocks. 

We conclude that there does not seem to be any major optimization effect through 


compiler optimization. 


b. Maximum Block Size Test: 


Since cumulative performance increases with block size, the question of 


what is the biggest block size is of interest. Test result printouts show: 


2S) 


Sent and received 100 blocks of 1440 bytes with errors. 
Total transfer of 288000 bytes with 0 errors in 0.190000 
el Ov eoemDs - 

Sent and received 100 blocks of 1460 bytes with O errors. 


Total transfer of 292000 bytes with O errors in 0.200000 sec at 
1.460000 mbs. 

Sent and received 100 blocks of 1520 bytes with 380 errors. 

Total transfer of 304000 bytes with 380 errors in 0.510000 sec at 
0.596078 mbs: 





Note that communication is unreliable or fails when the packet size exceeds 1440 
bytes. This result corresponds to the limit imposed by the Ethernet packet size. We were 
not able to eliminate the effect by increasing the TCP buffer size through software. Our 
conclusion is that STREAM sockets do not in general insulate the user from low-level 
packet size restrictions and that user level blocking at approximately 1400 bytes or less is 


required. 


ie FDDI Comparison on Other Machines 


To get a comparative performance of the STREAM socket interface on other 
machines we tested the client server pair on the phoenix and ntciris machines at TRAC 


Monterey. The operating system characteristics of these machines are: 


The uname -a command returns SUNOS phoenix 4.1.3 Ul 8 Sun4m. 
The uname -a command returns te @etieCiris sou 1 UO1SI0 TP) mips, 


These two machines are connected with 100 megabit per second FDDI links 
through a Cabletron Hub. On ntciris the FDDI interface card was purchased from 
Silicon Graphics Incorporated. On phoenix a s-bus FDDI 100mbs card from Network 
Peripherals was used. For 1000 back-and-forth cycles the following test results were 


recorded: 


Block size in bytes - 200 600 1000 1400 2000 3000 4000 5000 


Transfer rate in Mbytes/sec- .15 44 .68 89 1.18 1.41 1.68 error 





These rates were 40% slower than the Pentium T100 card systems by when 
comparable block sizes are used. However, in the FDDI protocol the basic packet size is 
4500 rather than the 1500 bytes in a conventional Ethernet. Consequently transmission 
errors did not occur until block sizes in the neighborhood of 4000 bytes were sent and the 
data rates were higher at the large block size limit of the FDDI system. 

This experiment demonstrates that large size packets on STREAM sockets do not 
transmit reliably unless the application performs handshake and blocking. It also showed 
that the performance of the Pentium systems was better than the Sun and SGI 
workstations they are expected to replace. To check the effect of the software selectable 


buffer size we executed the following code: 


sendbuff = 65536; 
optlien = sizeof (sendbuff); 
1f(getsockopt (sock; SOL SOCKEY,, SO lSNDEUr, eh 


&sendbuff, 
&optlen) Same) perror( getsockopr Curren y, 
printf(“default send buffer size = @d\n”,sendbuff) ; 





The result was default send buffer size = 61440 


This limit had no effect on the errors. We note the Ftp transfer of a 296960-byte 
file across the campus 10mbs link is 233kBytes/s. while the same file transferred between 
the two machines at 160kBytes/s. These numbers are very usage dependent and only 
provide a feeling for typical rates expected from reliable large data transmission codes. 
Clearly the program to program reliable transfer rates appear to be between 10% and 20% 


of the underlying hardware capability. 


8. Reliability Test 


To check reliability we ran several of the tests shown in figure 15 for long periods 
of time. Specifically, the 3 card, 6-process test was run for 16 hours. This represents 
approximately 1.5 terra-bytes of data with no detected error. Our conclusion is that both 
the hardware and software is stable as long as the block size 1s below 1400 bytes and 
sufficient handshake is implemented by the socket user to eliminate Ethernet buffer 


Overrun. 


2 One Way Test 


Rather than using an echo scheme, we programmed the client server pair so that 
one would only write while the other only read. We originally thought was that this 
would always keep the input buffers full, and we hoped the “advertised” flow control for 
STREAM sockets would allow the internal software to optimize throughput. To our 
Surprise this resulted in the same type of errors that occurred with large block sizes. 
Further investigation showed that the client or “write” function calls did not block as 
specified in the literature. Instead, all writes would execute even if the server 
corresponding read had been suspended. 

At this point the only way we could get reliable one way transmission was to 
suspend the write side for long enough to allow the link to recover. This means the one 
way transmission rates are approximately half of the numbers stated in Figure 15. It was 
apparent that the concept of “keeping the hardware pipe full” has not been effective. A 


communication handshake seems necessary. 


10. Message Passing Interface (MPI) and LAM 


The Message Passing Interface (MPI) is a networked parallel-processing standard 
based upon the message passing program paradigm. Several manuals listed in the 
documentation section describe the C code bindings required to implement this standard. 
MPI implements a message scheme for multi-process functions to communicate across a 
network while LAM adds remote host control functions through the use of a supervisory 
daemon. 

The following test results exercised an echo routine similar to the socket 
algorithm described above only using the MPI Send() and MPI Recv() functions. The 
time here is the total time for all the back and forth messages to take place in megabytes 
per second. The total bytes transferred is twice the value shown in the lway_#ofbytes 


column and only one half the duplex transmission connection is exercised. 
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Table 1. MPI Round Trip Transfer Performance 
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The tests conducted indicate transfer rates using LAM are similar to the rates 
achieved by the direct socket test programs discussed above. For small block sizes the 
socket programs were faster. However, when block sized above 1400 bytes were given to 
the MPI functions the resulting transmitting rates were about 25% more efficient. 
LAM/MPI uses more CPU resources than the socket code. For the 1400 byte block size, 
the 1.2 megabyte per second rate left only 58% of the CPU idle. The corresponding 
socket code left 77% of the CPU available for other applications while achieving 
].5mbps-transfer rates. 

We also conducted one way transfer tests. In these test the message blocks were 
sent in one direction and without an echo. The last block was echoed back to allow some 
error checking. The result were as follows: 

rE or Time Rate 
fo [640200 _|200 «| 3201s 2.599309 | 0.246296 |10 [1.8 


fo | 4o200_[ 600 [1067 | 0.417554 [1.533218 [16 [2.1 __ 
.o___} 642000 _[ 1900__|641__|o. 342754 | 1.706 123 
32 
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75 
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19 | 720000 |sooo0 [9 i 0.337875 | 2.130966 [43 | 4a 
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Table 2. MPI One Way Transfer Performance 
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* NOTE: the wall clock time for the 640000 byte transmission was 2 second indicating a 
substantial amount of time in setup. 


The MPI implementation provides reliable one way transmission at essentially the 
full line rates achieved for the socket implementation. For large blocks the MPI provides 
a better communication service than direct socket programming and appears to have 
solved Ethernet packet size limits and one way reliability problems which surfaced in the 
socket implementation. 

Here we also show the %CPU idle for selected test. These were measured by 
executing the UNIX TOP command repeatedly while sending a continuous stream of 
messages of the indicated block size. The TOP command shows the percentage of CPU 
idle as a snapshot of current activity. This figure typically varies from snapshot to 
snapshot. The numbers shown are averages over several TOP snapshots. The only other 
process consuming CPU resources at the time was TOP itself which took 2.7% of the 
processor resources. The results clearly indicate that the LAM/MPI code achieves its 
performance at the cost of heavy CPU cycle utilization. 

Results we have obtained for blocks less than 1400 bytes are consistent with the 
expectation that small performance penalties are inherent with the extra service provided. 
Though important for the overall TELLUS development these results primarily verify the 
validity of the socket implementation discussed in the last chapter. There appear to be 
reasonably significant faster transfer mechanism implemented inside of these more 
sophisticated MPI packages which would speed up the application software level 
communication rates. 

The emergent MPI as a platform independent standard together with these 
performance results indicate that MPI is a preferred implementation API for distributed 


processing applications at this time. 


B. IMAGE DISPLAY TEST 


A-simple X Window based routine was written which generates two 256x256 
image buffers in main memory. The program then calls the XputImage() routine to 
alternately write the images to the window. This tests the time it takes to display a 
perspective view image once the content is calculated. 

Figure 7 shows the basic timing loop at the top. This calls UpdateDisplay() 
which displays the content of the buffer passed to It. 
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puts (“PRESS RETURN TO START BENCH MARK...”); 
getchar (); 


window = (WINDOW *) 
DisplayPixels (array[0], 256, 256, colortab, 64, ““, 
argc, argv, bboard):; 


XtAddCallback (window->pd, XtNdestroyCallback, (XtCallbackProc) exit, 0); 


a = time(0); 

for i=n=0;1< 100; i++, n= !n) 
UpdateDisplay (array[n], window); 

b = time(0) - a; 


printf (“Total time took %ed seconds to display 100 frames\n” 
“Average frame rate was %d per second\n”, b, (int) (100 / b)); 


exit (0): 


XtRealize Widget (toplevel); 
XtAppMainLoop (app); 





void UpdateDisplay (unsigned char *pixels, WINDOW *window) | 
{ 


register int x, y, Ww, h; 


Ww = window->image->width: 
h = window->image->height; 
window->image->data = pixels; 
XPutimage(XtDisplay(window->w), XtWindow(window->w), window->gc, 
window->image, 0, 0, 0, 0, w, h); 


— 





Figure 17. Image Display Benchmark 


The following results were noted during execution of the XputImage() routine. 
As you can see, Pentium based PC has the potential to provide smooth video at a 
substantial frame rate. Recently released inexpensive video cards promise even faster 


frame rates. 


oS) 
ca 











Pentium 100 with an S3-968 based PCI video card 


Table 3. Frame rate comparisons 






OF CPU AND MEMORY TEST 


A standard ray trace benchmark (Appendix C) was tested on a series of machines. 
The program sets up a 1000x1000-element terrain array with elevations set at 500 meters 
and performs 50,000 ray trace calculations from 1000-meter altitude to 500 in 5 meter 
steps. The total number of inner loop steps is 50 million. The following table was 
published (Baer, 1991) and is duplicated here along with additional tests performed on 
TRAC and NPS machines. 


243 
i860 33 MHz 
93 
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Table 4. CPU and Memory Benchmark Comparison 


Based on these benchmarks, the only processor faster than the Pentium-90 was the 
i860. These tests on the 1860 were conducted at Intel after a similar test conducted by 
IDC showed performance numbers of more than 80 seconds. We know Intel engineers 
ran the code exactly as defined but were not able to get other details (.e. memory speed, 
etc.) to fully understand the very high speed this test produced. 

The Pentium PCs under Linux using the free GNU g++ compiler are as fast or 
faster than any of the workstations on which the PEGASUS battlefield simulation has 
been run. We should note that although the SGI (ntciris) had four processors, the 
program only executed on one of them. The dual-processor Pentium-pro 200-megahertz 


machine is extremely fast for our application. 


1. Effect of Memory Speed and Cache 


The effect of CPU speed does not seem to be the determining factor in the 


performance result of the ray trace benchmark. 
The statement 
e += de; 
increments the ray tip position across columns. When run on the Pentium 90, 
execution time was reduced to an amazing 4 seconds. This is identical to the time the 
program runs if all main memory access is eliminated by taking out the array reference 


Statement: 


zt = terrain[uwe] [uwn] ; 


Hence we conclude that 90% of the execution time is spent on main memory 


access. Increasing the size of the cache or purchasing faster memories would be a more 


effective speed up than simply getting more CPU cycles by buying increased clock speed 
machines. 

The effect of additional local DRAM memory hits was explored by adjusting the 
benchmark routine so the terrain array was organized in blocks instead of rasters. This 
was accomplished by replacing the bit shift statements above the terrain reference by the 


following routine: 


el = (e>>16) & oxf; 
nl (mes i2)) & Ox ; 
uwl= nl | el; 


e2 =(e>>20) & Oxf; 

m2 =(n>>16) &Oxf0O; 

@3 =(e€>>16) &0x300; 

foe= (o> 4). &0xc00; 

“wha = ng |) e3 | n2 | e2; 
zt =terrain[uwh] [uwl] ; 


Pla bewCr Seer): 





Figure 18. Extract Upper 16 Bit Coordinates 


This code runs in 20 seconds with the modification--twice as fast as the raster 
access code in the standard benchmark. This example demonstrates how performance 
can be dramatically improved by simply replacing the two lines of code executing two 
shift operations with eight lines of code executing 17 bit operations. This information is 
of great importance both in guiding the writing of fast code as well as making purchasing 
decisions. We speculate that the Pentium Pro design, which includes a 256kb cache on 
each processor in the dual configuration, is fast because of this cache effect, not the clock 


speed. 


Die Frame Rate Performance Predictions 


The CPU performance benchmark is indicative of the frame rate achievable from 
a multi-processing system executing the inverse ray trace PVG algorithm of the 
PEGASUS/TELLUS projects. 


Table 5 compares the benchmark results on various machines in which the 


simulation has been implemented and actual frame rates measured. 


Machine Speed # Processors Frame Rate 
In seconds 


1800 25mhz [0 
T800 and Power PC 601 40 (ex 
SGI PowerSeries 


SUN (phoenix) 
Pentium 100 underLimux [40.—~<di1——S~=~—~S*d 
Dual Pentium 133 SMP Linax [20 [106 


Dual Pentium Pro 200 under Leos 30 
Windows NT 


Table 5. Benchmark and PVG Frame Rate Comparison 





The relationship between the system speed for the benchmark and the frame rate 
achieved in an implementation is fairly predictable at one frame-per-second for 20 
seconds of system benchmark performance. This would indicate a dual Pentium Pro 
could theoretically be able to generate at least thirty frames per second. However, video 
display and disk access rates will limit the actual frame rate to approximately 15 frames 
per second or less. 

Our findings indicate that the TELLUS PVG will be able to generate about 10 
frames per second for a 256x256 pixel-sized frame. At this rate, sufficient CPU capacity 
should be available to include several high-resolution targets. The bottleneck will be disk 


access and video display speed. 





IV. SUMMARY, RECOMMENDATIONS, AND CONCLUSIONS 


aX SUMMARY 


The personal computer is now powerful enough to handle computationally 
challenging tasks associated with perspective view generation. | When properly 
configured with the high-speed input/output devices, performance rivals that of the UNIX 
workstations considered “state-of-the-art” Just two years ago. Additional performance 
can be leveraged from systems using a parallel approach that includes multiple processors 
on a single machine. However, Fast Ethernet performance still does not provide enough 
bandwidth to transfer uncompressed video over the network at a usable frame rate. 

When cost is a factor, system architecture built around Intel’s Pentium processor 
provides the best value. The Pentium’s lack-luster floating-point performance may 
hinder computationally intensive applications and justify a move to a more expensive 
Alpha based platform. 

Many communications libraries exist to ease development of message passing 
strategies. TCP/IP sockets and streams remain the most portable and well documented of 
the libraries we examined. The Message Passing Interface, however, is an emerging 
standard that provides robust communications interfaces and reliable cross-platform 
communications. MPI is simple to use and eases communications intensive software 


development. 


B. CONCLUSIONS 


Our research concludes that an Intel Pentium equipped Personal Computer is a 
worthy platform suitable for applications such as the PEGASUS Perspective View 
Generator. The current Pentium Pro provides ample floating-point and integer 
performance to deliver POV ray trace calculations at reasonable speeds. The Linux 
Operating system, coupled with an X-Windows Motif port, offers an affordable 
alternative to proprietary and often expensive commercial versions of UNIX. Because 
Random Access Memory (RAM) performance continues to lag behind CPU performance, 
bottlenecks exist between the CPU, the memory bus, virtual memory and the hardware 


bus. Even with the advent of 1OO0BT Ethernet cards, any architecture that attempts to 
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share memory over the network using message-passing protocols adds another choke 


point to the equation. 


CG. RECOMMENDATIONS 


A distributing approach for further development effort on the PEGASUS 
Perspective View Generator is not recommended due to bandwidth constraints of Fast 
Ethernet. A parallel approach using local memory on a multi-processor platform offers 
greater promise of performance improvements related to real-time terrain modeling 
simulations. Further exploration of the availability and accessibility of hardware 
solutions designed to improve upon the limitations of the 33-megahertz PCI Bus speed is 


warranted. 


D. ANSWERS TO RESEARCH QUESTIONS 


To what extent can the PCI bus and new high performance peripherals 
deliver the performance necessary for battlefield simulation application? 
Throughput on 100BT Fast Ethernet peaks at 3mb per second with average throughput 
holding at just under two megabytes per second depending on the packet size. 
Theoretical advertised transmission rates were not achievable. Bus latency appears to 
limit performance, but more research must be conducted to identify the limiting factor. 

What other performance considerations affect the suitability of the Personal 
Computer as a prospective platform for high- speed terrain modeling? Ray trace 
calculations continue to be computationally intensive and a major limiting factor overall 
frame rate performance. Processors are getting faster, and the Intel’s Pentium Pro 
processors is already capable of exceeding the number of calculations per second 
achieved on the UNIX workstation considered state-of-the-art just two years ago. Recent 
advances in personal computer video card technology offer promise of even faster frame 
rate to enhance Perspective View Generation realism. 

Can a distributed approach enhance performance of the Perspective View 
Generator? A distributed approach does not appear to be worth pursuing at this time. 
Parallel workload distribution is suitable when the problem can be divided into pieces 
that require lots of computational power but consume just a small amount of the network 
bandwidth. Although ray-tracing calculations are easily divisible by terrain or display 
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regions, the volume of resulting data that must be transmitted back to the machine 


displaying the view will not support adequate frame rates. 


E. RECOMMENDATIONS FOR FURTHER STUDY 


Linux served as attractive operating system for the TELLUS project in that 
porting problems from UNIX were minor since Motif served as the common interface. 
However, most commercial hardware vendors do not provide drivers for their equipment, 
leaving researchers to either write drivers on their own or wait on the Linux programming 
community to support the equipment. Windows NT 4.0 appears to be well supported 
since new hardware is seldom released without drivers for both Windows NT and 
Windows 95. Both Windows NT and Windows 95 support OpenGL version 1.1. 

Since OpenGL has now been ported to the PC, the Perspective View Generator 
may benefit from its now portable API. Initial tests of lost cost Glint based video cards 
show as much as a 10 times performance gain over previous 2D cards. Exploring both 
OpenGL and Microsoft’s DirectX API may be valuable since much of the 
computationally intensive work of generation of the view can be moved from the host 
CPU to the graphics hardware. Under OpenGL, ray-tracing algorithms may not be 
necessary since the OpenGL engine can handle perspective view generation. A critical 
question is whether the new PC based 3D cards can handle the demands of real time 


terrain modeling. 





APPENDIX A. LINUX LAM/MPI INSTALLATION GUIDE 


A. WHAT IS LAM? 


LAM (Local Area Multicomputer) is an MPI (Message Passing Interface) 
programming environment and development system for heterogeneous computers on a 
network. With LAM, a dedicated cluster or an existing network computing infrastructure 
can act as one parallel computer solving one problem. This paper will discuss some of 
the current issues affecting parallel computing in general. More specifically, it will 
present LAM/MPI implantation as a cost effective way to increase computing 
performance simply by adding low cost machine to Local Area Network. 

LAM is a product of the Ohio Supercomputer Center in Columbus Ohio and is 
implemented on the following platforms: 

Sun (SunOS 4.1.3, Solaris 5.4) 
SGI (IRIX 5.3, 6.1) 

IBM RS/6000 (AIX V3R2) 
DEC Alpha (OSF/1 V3.2) 


HP PA-RISC (HP-UX 10.01) 
Intel X86 (LINUX v1.3.20) 


B. WHY USE LAM/MPI? 


Since MPI is nothing more than an API for communication machines, why use 
LAM instead of TCPIP Sockets? While one could certainly achieve the same goal using 
Sockets, the amount of programming involved is monumental. MPI at its simplest level 
involves calls to six basic function including the send call on a broadcasting machine and 
a receive call on the receiving machine. The length and structure of the data being sent 
can be formatted in user definable types of any size. Socket programming, on the other 
hand, requires you to keep track of individual packets by limiting the size of a packet to 
around 1500 bytes. Transmitting packets larger than 1500 requires that some negotiation 
occur to ensure error-free transmission of the data. During testing of LAM, we 
consistently sent 640,000 bytes in a single send/receive pair with no errors over a 100- 


megabit per second LAN. 
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c: GETTING STARTED WITH LAM UNDER LINUX 


The installation of LAM under Linux can completed by following the instructions 


outlined below: 


e Obtain and install the source code for LAM/MPI 


e Modify the HOME environment variable to point to the desired destination of the 
LAM binaries. 


e Establish a symbolic link from the appropriate operating system stub file in the 


Config directory to a file called config. 


e Establish accounts on each machine in your cluster giving each equivalent rlogin 


privileges. 


e Establish a LAM based hosts file that lists the names of all machines that will 


participate in the group/cluster. 
e Test your network by running the LAM recon. 
e Start LAM running on each machine with the /amboot command. 


e Write, compile and execute a simple program under the LAM environment! 


D. OBTAINING LAM 


LAM is available in source code format via FTP from ftp://ftp.osc.edu/pub/lam. 
Along with the single file containing the baseline source code called /am60. tar.gz, there 


is an additional file containing 17 patches that have to be applied -- /am60-patch.tar. gz 
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E. INSTALLING LAM 


The files were archived using the UNIX tar program and compressed using gzip 
to save space. I chose the /usr/sre directory as a place to extract the source files in the 


following manner: 


> GunzZzip Lamedr tar soz 


The resulting file, lam60.tar, must then be extracted as follows using the tar 


command: 


Star xvVL lamoorer 


The files will be extracted into a directory called /am60. 


Once the files have been extracted into the new directory, copy the file /am60- 


patch.tar.gz into the /am60 directory. 


F. APPLYING THE PATCHES 


Likewise, extract the patch file as follows: 


$ GUNZapelam6O=patch tarmac 


The resulting file, /am60-patch.tar, must then be extracted as follows using the tar 


command: 


Star xvi lamGo es 


When this file is extracted, you should have 17 patch files to apply ranging in 


names from /am60-patch01 to lam60-patch17. Each patch must be applied separately, 
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and in order. For example, you can’t install patch 5 before patch 4. To install the first 


patch, make sure that you are in the lam60 directory and execute the following command: 


$patch -p0 < lam60-patch0l 





If all goes well on the first patch, you should see the following message: 


Patching file share/mpi/MPI.c using Plan A... 
Hunk #1 succeeded at 4. 
Hunk #2 succeeded at 217. 


Done. 





Apply the other 16 patches in the same manner. 


G. CONFIGURING LAM FOR INSTALLATION 


The only instruction document for installing LAM (other than the document you 
are reading now) is located in the /lam60/doc directory. Unfortunately, the file is in 
HTML format, so if you attempt to view it, you’ll have to deal with the HTML 
formatting characters. The actual file name for this help file is /am-install. html. 

The /lam60/Config directory contains copies of all the stub files necessary to get 
LAM to compile with the appropriate operating system. Change to the /Jam60/Config 
directory. By default, the LAM installation process installs a stub file to the Sun 
Operating System. Remove the existing config file (which is just a symbolic link) by 
typing: 


$ rm -rf config 


Now it is safe to establish a symbolic link the Linux stub file as follows: 


$ In -s config.i386_linux config 
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Once the symbolic link file config has been created, open it for editing. You'll 
notice near the top of the file there is an entry referring to the Environment Variable 


HOME. 


The line should look something like this: 


HOME= /tmp/lam 


This entry tells the LAM compilation process where to put the executable binaries 
once the files have compiled. You should edit this directory entry to reflect your 


preferred directory where you want the files placed. 


H. COMPILING LAM 


Build LAM by changing to the /am60 directory and type make. To run make in 


the background, execute the following command: 


$ make >& LOG.TXT & 


This forces Linux to run make in a background process and capture all output to a 
file called LOG.TXT. At the end of the compile, you can check the contents of the LOG 


file to make sure you had a successful compile. 


When making the executable in this manner, you will get no visible feedback that 
the compile is actually working. I suggest you monitor its progress with "top". When the 
compilation is complete, examine the LOG.TXT file to be sure that no errors have 
occurred. The actual compile process can be surprisingly slow (10 minutes on my 100 
MHz machine and 6 minutes on my 133 MHz machine) as there is lots of code to 
compile (120+ object files are created). That's almost as much time it takes to compile a 


kernel! 


I. POTENTIAL PROBLEMS WHEN COMILING 


We recommend installing LAM with the latest version of the official release of 
Linux, which currently is version 2.0.20. However, if you happen to have an earlier 
version of the gnu C compiler and an earlier version of Linux you may encounter the 


following error during the compilation process: 


In file included from ../../../share/mpi/rpi.c2c.c:33: 


/usr/include/sys/uio.h: redefinition of 'struct iovec' 


Marcie? (ipi-c2c.o] Error | 





There are two system files that cause this problem. To correct the error you 
should either obtain the latest version of Linux and the GNU C compiler or modify the 
affected files. 

The two files are: /usr/include/sys/uio.h and /usr/src/linux/include/linux/uio.h. If 
you examine the latter file, you will see a warning hidden away as a comment 

/* A word of warning: Our uio structure will clash with the C library one (which 
is now obsolete). Remove the C library one from sys/uio.h if you have a very old library 
Sel 

We can only guess what defines a “very old library set”. Suffice it to say that | 
had to manually modify my files until I installed the newest version of the sources. To 
correct this problem without going through the trouble of installing a new C library, you 


can edit the files as follows: 


e Remove the definition for struct iovec in /usr/include/sys/uio.h 
e Include the file "/usr/src/linux/include/linux" with the includes at the top 


~ of the same file. 


This replaces the original definition for the structure iovec with the Linux unique 


version of the structure. 


J. RUNNING LAM 


LAM requires rlogin access between each of the machines it will be running on. 
The easiest and safest way to do this is to create a .rhosts file in the home directory of the 
account you've create for LAM on each machine. For security reasons, this file must be 
owned and created by the superuser (System Administrator) but should be readable by 
all. The format for the .rhosts file 1s: 


[hostname] <user> 


Let's examine a sample .rhosts file that LAM could use. Assume that we have 5 
Linux machines 1n our network named wolfpc, baerpc, lionpc, pumapc, and owlpc. On 
each machine we've established a LAM account. Each account must have the same login 
name and password. Each machine must have LAM installed and the accounts must be 
able to access the LAM binaries (the lam/bin directory must be in the default path). 


The sample .rhosts file for this scenario looks like this: 


wolfpc 


baerpc 
pumapc 
tigerpc 


owlpc 





We can check the file permissions for our sample .rhosts file: 


Sls =a 2 2neses 


PS aalecnes 





All users from each of these machines are granted access to their accounts without 


a password check. 


Prior to running LAM you should attempt to rlogin to each of the machines that 
will be sharing on the LAM network. Log in as the LAM user on one machine and 
attempt to rlogin to all the other machines. If you are prompted for a password, either the 
permissions are set wrong on the .rhosts file or there 1s no .rhosts entry for your host on 
the other machine. 

Another important element to setting up your LAM system is to be sure that the 
LAM binaries (typically installed in the /tmp/lam/bin directory) are in the path of the 
LAM users you have established. 

Another easier, but less secure way to establish host equivalency for lam on your 
network is to add the host names to the /etc/hosts.equiv file. For some reason you have to 
add the fully qualified host.domain-name as well as just the host name to the file. Each 
entry must also be on separate lines. 


In our example, the hosts.equiv file would look like this: 


# Sample hosts.equiv file 


wolfpc.mbaynet. 
wolfpc 
baerpc.mbaynet. 
baerpc 
pumapc.mbaynet. 
pumapc 
tigerpc.mbaynet.com 
tigerpc 
owlpc.mbaynet.com 


owlpc 





Ke IDENTIFYING THE LAM GROUP PARTICIPANTS 


Once you have the binaries installed on each machine, and have rlogin 
equivalency established you are ready to test the network. However, you must create a 
file that lists the names of the hosts who are allowed to participate in the LAM network. 


A sample file called host./am was installed in the /tmp/lam/boot directory. The name of 


a2 


the file is important since it is passed as a command line parameter when you start LAM 
running in the background. Simply list the host name of each machine that will 


participate. 


Now you are ready to give it a try! 


L. THE RECON COMMAND 


To test your configuration, type the following command 


recon -v /tmp/lam/boot/host.lam 


The recon command attempts to log in to each of the hosts listed in the host.lam 


file and test them to see if in fact the lam daemon can be executed. 


# Sample LAM host file 


Wolo >c 
baerpc 
pumapc 
ELGeSEDC 


owlpc 





The most common reasons for an error to occur during recon are: 


e No equivalent rlogin privileges for the host specified. If this occurs, check 
the hosts.equiv file to be sure you’ ve included an entry for the fully qualified 
host name as well as the shorted host name only. 

e Incorrect path on the tested host. If the LAM recon utility can’t execute the 


LAM binaries from the user’s home directory it will fail. In this case you 


must modify the csh.login file to be sure you've added the appropriate 
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Statem 


ent to the environment path so LAM can get to the binaries. If all goes 


well, your output will look something like this on the originating machine: 


testing 


testing 
testing 
testing 


testing 





M. BOOTING LAM 


Once you ve successfully executed recon (with no error messages returned), you 


are ready to boot LAM. Booting gets the LAM daemon running on each of the machines 


participating in the group. 


The boot c 


ommand is executed as follows: 


$ lamboot -v /tmp/lam/boot/host.lam 


When this 


command executes, it displays the following message: 


LAM 6.0 - Ohio Supercomputer Center 


hboot nO 
nbookr ni 
nDOoot: nz 
hboot n3 
hboot n4 


topology 


When you 


(WOUEDe) 4.5 
f[eaerpC) x... 
(Egerpc). 
(pamaoc) =. 
ConGis ele ae 


done. 





see the “topology done” message, you know you are in business! 
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N. COMPILING PROGRAMS USING LAM 


If you’ve made it this far you are probably want to get to work writing your 
programs. | recommend you at least compile and run the sample program ezstart.c 
provided with LAM in the /lam60/examples directory. 

To compile the program, execute the following command: 


$ hee -o ezstart ezstart.c -lmpi 


Notice that the LAM folks, to make compilation simple, have provided the hcc 


command. It magically compiles and links your code without a lot of fuss. 
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APPENDIX B. TCP/IP SOCKET BENCHMARK PROGRAM 


[BR IK a RIT IK ER FM ON ee 


* 


* FISENAME : SCUVCYr Vasile. c 
* PURPOSE: Test the socket communications to a remote host by 


receiving 
ss any number of blocks of variable length bytes 
i the length must be sent in the first transmission 
z and sending a reply 


KKK KKK KK KKK KEKKKKKKEKKKKEKKKKKKEKKKK KKK KKEKKKKKEKKKKEKKaKEKKKKKEKEKK KK KEK KKK KKK KKH 


/ 

#inelude <Stdioun> 
#include <sys/types.h> 
#include <sys/socket.h> 
#include <netinet/in.h> 
#include <netdb.h> 


#define PORT NUMBER 1053 

#define MAX BLOCK SIZE 65536 
#define MAX NUMBER OF BLOCKS 65536 
#define MESSAGE HEADER SIZE 8 
#define VARIABLE HEADER TAG 3141539 


Mainiarge, azay) 
int aege, 
char tangy (21; 


int message header size; 

IgG, Jeleeiets a oye | lh 

int message body size; 

int message ody (MAxTEUCCh ye Ze, 
nbgne Jo) 72iSts) eee vels 

int 3se@ek, 

int msqsoOck; 

StYUCt SOcka@Giay ieserver, 
int backlog, serverlen; 

int bindstat; 

Zn j7e. COMM. -eresoie nee eae 
ine dlagneselers wee i, 


/* check on usage */ 

LE tavoo: 2— 6s) 

{ 
fprintf(stdout, Usage > server variable([diag switch] \ nm jy 
exit (0); 

} 


/* decode diagnostic switch if present */ 


if (aroqcw=— 2 
{ 

sscanf(argv[1], %d, &diagnostic switch); 
} else 


ay 


{ 
Gragaesire Switch = 0; 


} 


/* open server socket */ 
Sew eee c rer (sh NET, SOCK STREAM, IPPROTO IP); 
fet (seek < U) 
{ 

perror (“opening stream socket ”); 

exit (1); 
} 
/* Set up parameters for socket connection */ 
Senvem.sl stants Ar INE; 
Server Sin vaddr.s addr — INADDR ANY; 
Server. sia port = htons (PORT NUMBER) ; 


Vy ind tO server socket */ 
pncacstaiwe— oid (| SOGCK,—§ (SLEuUcCL sockaddr *) & server, sizeof server); 
Peetoimastat “< 0) 
{ 
perror (binding stream socket) ; 
exit (1)? 
} 
Peelisten co. socket */ 
backlog = 5; 


/* The backlog parameter defines the maximum 
* length the queue of pending connections 
Pay. Grow. tO. */ 


Prstemisock, oacklog) ; 


/* notify operator of server ready to accept socket */ 
printf("Server ready to accept socket.*C to exit.\n"); 


/* accept socket connection and get receive socket number */ 
serverlen = sizeof(server); 
msgsock = accept (sock, 
(sStnueGemsockaddc *) “Srserver, “(ant ~*~) 
&éserverlen); 

Peemsgsocik.< 0) 
{ 

Perron. accept “); 

exit (Zc; 
} 
/* receive the header message */ 
message header size = MESSAGE HEADER SIZE; /* bytes in header */ 
temeece(Megcock, header body, message header size) < 0) 


{ 


perror (“receive message on stream socket “”); 
exis (ee 

} 

if (header body[0] != VARIABLE HEADER TAG) 


{ 


printf£("Variable header tag missmatch.Wrong program pair.\n"); 
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exit(1); 
pao rse 
{ 
message body size = header _body[1l]; 


} 


/* notify operator of ready server a” 
printf ("Server starting read loop of %d byte blocks. *C to 
exit. ne 
message body size); 


while (1) /* start endless communication lecoe 


{ 


/* receive the message body */ 
if (read(msgsock, message body, message body size) < 0) 
{ 
perror ("receive message on stream socket"); 
exit (1); 
} 
/* Aiagnostic of communication loop 789)" "= gers 
if (dweaemest iol Swi tcl) 
{ 
1 = 2.4 1; 
comm error = 0 


for (j = 0; 3 < (message body size / 4); jtt) 
{ 
if (message body[j] != j) 
comm error — comm ler rere i, 


} 
print? (‘bleck td size tdverromy tems ie, 
i, message body size, comm_error); 
} 
if (diagnostic switch > 0) 


{ 


for (j = 0; 7 < diagnostic Switen aaa 


} 


[ERE ERR IEE RR EE ERR KR ERAT EH Ie 


if (write(msgsock, message body, message _body_size) < 0) 


{ 


perror (“Sending Message Onwsexcam SOCKEt je 
fp cite (Ni) 


} /* end while loop 


/* close the socken 37 
if (close(msgsock) < 0) 


{ 


perror ("Socket “close euvers); 


Oy 








[RE RREKRKREKRKKKEEKRK RK KK KHEKKEKRKEKK KKK KEE K LK RR RK KE A ARR RR 


* FIBENAME: sel cence Vicia aeaae te 
* PURPOSE: Test the socket communications Eo 4 Temote hostyby Sencime 


i a “n” blocks of “m” bytes each and receiving a reply 
‘6 This routine transmitts the number of bytes in the block 
and must be run with the“ Serverivariaere erogm ome 


We te ee ke te ke keke he ee eee Ke ek eR ARR Ke KK KR Re BR ee 


Hineluce. <stCc tema 
#include <sys/types.h> 
#include <sys/socket.h> 
#include <netinet/in.h> 
#include <netdb.h> 
#include <sys/time.h> 


#define PORT NUMBER 1053 

#define MAX BLOCK SIZE 65536 

#define MAX NUMBER_OF BLOCKS 100000000 
#define MESSAGE HEADER SIZE 8 

#define VARIABLE HEADER TAG 314159 


main{argc, argv) 
IM, aaege: 
Chars ere wai 2a:, 


Int Si Ze jobebloct, 
int number of blocks; 
Pies MeSsseqo eg Wedcoimmanece, 
int header_body[2]; 
int message body s1Ze; 
int message body (MAX eeLOChee2 7a 
ihGe SOGk. 
struct hostent *remote Nost 7s getnect oO nanc an, 
SETUC tN ooOc Kader sin scrver, 
Ine COnmMecetmetac, 
Ime, dee eee 
Ine Guagncsele nowt een, 
INE Comm, Cine; 
struct itimerval timevalue, oldtimevalue; 
Long "Ci tier Soc we iMec lr attce e, 
float timedif; 
/* check on usage */ 
if ((arge != 4) && (arge != 5)) 
{ 

foOriner(staeue 

"Usage > client variable remote host 


number of blocks \n "); 
fprintf(stdout, “size in bytes[diag switch] \n7); 
ex Une 


} 


/* decode diagnostic switch if present */ 
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sscanmargv(4), "td ", &diagnostic switch); 
} else 


{ 
diagnostic switch = 0; 


} 


/* get block size and number of blocks from command line */ 
Secagerargv| 5), “sd”, 65126 of block); 


preenere ws the block size in * bytes */ 

sscant(argviz2]), "sd “, &number of blocks); 

Size ot block = size of block / 4; /* convert to integers */ 
Pe oreo umoleccs MAX SLOCK SIZE) 


{ 
printf("Block size should be less than %d bytes.\n ", 


MPA BLOCK SIZE * 4); 
exit(1); 
} 
Pemuinoet Or olocks > MAX NUMBER OF BLOCKS) 


{ 
Peden Numoem or blocks should be less than $d.\n ", 


MAX NUMBER OF BLOCKS); 
exiteel): 
} 
/* open client socket */ 
sock = socket (AF INET, SOCK STREAM, 0); 
tet “sock. = 0) 
{ 


perror("opening stream socket "); 
exit(1); 
} 


/* Set up parameters for socket connection */ 
server.sin family = AF_INET; 


/* check to see if this is a valid host */ 
Bemoreuhost — gecthostbyname(argv[1)); 
i 2 noes Ose —— UY) 


{ 


PeiieersedShey  ssuncnown host \n “, argv[(l]); 
exie(2Z); 
} 
beep Gichiame |) cemore Nost--h addr, (char *) &server.sin addr, 


Eemotc Nest=sh length); 
server.sin port = htons (PORT NUMBER) ; 


Vet=Connece LO server socket */ 
connect stat = connect(sock, (struct sockaddr *) & server, sizeof 


server); 
ite Conneel stat tae) 
{ 
perror('connecting stream socket "); 


fprintf(stderr, "Possible problems are:\n "); 
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fprintf(stderr, “Did you Start the Server sored ranecome le oenew 
machines? \n. "a 

fprintf(stderr, "Big - little endian may be reversed for 
DOr ea muMbe rT | \vime, 

fprintéi (stderz, “Trying Using |server.. esa 
htons (port _number)].\n "); 

exit(l]); 

} 


/* build a message content */ 


message body size = size of@eBock ~Ssizeor (| tae 
for (1-= Oye < size,0t bicée a 
{ 

Message oOciy a], 


} 


/* set up an interval timer to have 100000 seconds */ 
timevalue.it value.tv_sec = 100000; 
timevaluc. Tey cle evusSee— ae), 
LES (Seti cimer( Pipe eain, 
(struct itimerval *) & timevalue, 
(struct itimerval *) & oldtimevalue) < 0) 
{ 
perror("Set time "); 
} 
/* now get the current time */ 
if (getitimer(ITIMER REAL, (Struct itimerval *) S cldeimevatus a 


{ 
perror( First GeeLltimer ~ i, 
} 
/* send the message header */ 
message header size — MEssAGe bake wz, 
header body[0] = VARIABLE HEADER TAG; 
header body[1] = message body size; 
if (write(sock, header body, messageuneager gat 7e) a.) 


perror("Writing header on stream socket "); 
exae ia): 

} 

tomy (La= 7a. < numb engorr a locke ald) 


/* send the message buffer */ 
if (write(sock, message body, message body size) < Q) 
{ 
perror ("Writing message on stream socket "); 
printf("Tried to send % d bytes last integer = %d \n 
message body |(mesSaqes>edyeeuze an 


! 


Src (ds): 
} 
{/* veceive a reply: 7 
if (read(sock, message body, Message fed, Siz—) ya) 
{ 
perror("Reading message on stream socket "); 
exit cl); 


/* eragnostic Or Ene communieakt ton Meiers kk ek ok ak kk ek ke Re / 
Pm emagnest uC Switch < 0) 


{ 


Scumecte cor = 0; 
Rawal = 0, ace size Of block, 37+) 
{ 
Pieemessage beady) ) |) = 7} 
eomm Error = comm error + 1; 


} 
porben( ‘block sa size td error td \n", 
iPemes sage bedy S'Ze, “cCommiemror) ; 
} 
Mes cmeqmest i crswitch’ > 0) 
{ 


ewe tignosuic Switch, g++) 


} 


Pr re RRR ERE RRR KEK KEK AKER AK K RK KKK KKK EKER KKK KKK KKK KKH KKK / 


} /* end communication loca */ 


/* now get the current time */ 
if (getitimer(ITIMER REAL, (struct itimerval *) & timevalue) < 0) 
{ 


pernor( Second getitimer ™); 
} 
Pie cuemscoe— Oldtimevalue.it value.tv sec - 
timevalue.it value.tv_sec; 
ooh ec oe ss olatimevyalue.it Value.tv usec = 
Pimevalucwt yvaluc.ty usec; 
timedif = (float) timedif sec + ((float) timedif usec / 1000000); 


femevaliiate the communication */ 


comm error = 0; 
Popenee—eO, l= Size of bilock; i++) 
{ 
iememessage body(i] !> 1) 
Settimerr er — Cec Cruor. ee 


Cee eineecesulte to terminal */ 
printf("Sent and recieved % d blocks of % d bytes with 3 d 
erieus sn, ", 
MUMoe sone oLocks S176 501 Jepllock ~*~ 4, comm error); 
Pree riloraletransfer of t d bytes with 3% d errors in %@ £ seconds 
Sere. © mes. \n 
Mumbon of gokoeks 992 = size of block * 4, comm error, 
timedif, 
eelcot) i inumber or blocks) * 2 “~~ size of block * 4) / 
Pameqiam)/ 1000000 .0) ; 
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/* close the socket */ 

Lf (cleset(sock). 10) 

{ 
perror ("Socket elese errors 4) 
esas) Ge. 

} else 

{ 
epee ees 

} 
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APPENDIX C. RAY TRACE BENCHMARK 


[eR RK EK ER KR KR RK RK ERK KKK RARE HK eK RE A ee 


PILENAME =? aeee nc 

PURPOSE: ray trace benchmark code 

Tests the speed of the basic inverse ray trace inner loop 

Kee Kee Kae KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK 
i 

#define MAX STEP 1000 

#define MAX RAY 50000 


main() 
{ 
nage eysl eyo cy: 
int terrain [MAX Stee ore, 
int dew, 62. 77-2. 
int e,n,uwe,uwn; 
TW ot ee Meas 
/* initaalaze Variables s 


for( 1=073 3 Vax TBP ear 
For{ 3-0; 30 Mexs ee wae 
terrains |i 7) =" 50 0sces2c- 


/* initialize arrecticm cosines 


X= .V66.2 26555 540 
(Tne (cr Oe 2G 6 
Gn = (imb ei aeOeoe 
az = (mt) 2)" 6o5cq). 
/*Trace MEX RAY (rays chee) ee Wee eo aie oe ee 
FOr 2-07 ea) ee eee 

{ 

/* Initialize trace Start cence ronse 

Zr = 0006S 556- 


OF 
1 
| 


e = Q; 
n = 0; 
do 


{ 


/*increment ray coordinates */ 


ZX -= az; 
e += de; 
nN += dn; 


/* extract upper 16 bits of coordinates */ 
uwe = e >> 16; 
uwn = n >> 16; 
zt = terrain[uwe] [uwn]; 
} whi le(s27 2 
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